Adaptive block switching with deep neural networks

ABSTRACT

The present invention relates to a method for predicting transform coefficients representing frequency content of an adaptive block length media signal, by receiving a frame and receiving block length information indicating a number of quantized transform coefficients for each block in the frame, the number of quantized transform coefficients being one of a first or second number, wherein the first number is greater than the second number, determining a first block has the second number of quantized transform coefficients, converting the first block into a converted block having the first number of quantized transform coefficients, conditioning a main neural network trained to predict at least one output variable given at least one conditioning variable, the at least one conditioning variable being based on information regarding the converted block and block length information for the first block, providing at least one predicted transform coefficients from an output stage of the main neural network.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of priority from U.S. ProvisionalPatent Application No. 63/092,685, filed on Oct. 16, 2020, and EP PatentApplication No. 20206462.2, filed on Nov. 9, 2020, which are herebyincorporated by reference.

FIELD OF THE INVENTION

The present invention relates to combining a generative model withexisting high efficiency coding schemes for media signals. Specifically,the present invention relates to a method for predicting the transformcoefficients of an adaptive block length media signal with a trainedneural network.

BACKGROUND OF THE INVENTION

In low-rate adaptive block length encoding and decoding the encoder isconfigured to optimize the trade-off between frequency and timeresolution. This may be achieved by selecting, by the encoder, atransform length for each signal sample block. In general, the encoderwill select a long block, with a higher number of transformcoefficients, for signal sample blocks representing signals with slowlyevolving temporal characteristics and will select a set of short blocks,each with a lower number of transform coefficients, for signal sampleblocks representing signals with rapidly evolving temporalcharacteristics.

A problem with encoding and decoding adaptive block length signals liesin that the blocks to be decoded may comprise a varying number oftransform coefficients representing the frequency content of the mediasignal over varying time durations of the media signal. Adaptive blocklengths are thus incompatible with traditional decoding schemesdeveloped for fixed block length signals. Also, it would be beneficialto obtain in the decoder a more accurate representation of the originalmedia signal which has been sampled in the encoder to form the signalsample blocks and adaptively divided into blocks of varying numbers oftransform coefficients.

GENERAL DISCLOSURE OF THE INVENTION

Based on the above, it is therefore an object of the present inventionto provide a method for predicting, with a neural network, transformcoefficients of an adaptive block length media signal, and in particularan adaptive block length general audio signal.

According to a first aspect of the invention there is provided a methodfor predicting, with a computer implemented neural network system,transform coefficients representing frequency content of an adaptiveblock length media signal. The method comprising receiving a block of aframe, each block of the frame comprising at least one quantizedtransform coefficient (or a set of quantized transform coefficients)representing a partial time segment of the media signal, receiving blocklength information indicating a number of quantized transformcoefficients for each block of the frame, the number of quantizedtransform coefficients being one of a first number or a second number,wherein the first number is greater than the second number, determiningthat at least a first block of the frame has the second number ofquantized transform coefficients, converting at least the first blockinto a converted block having the first number of quantized transformcoefficients, conditioning a main neural network trained to predict atleast one output variable given at least one conditioning variable, theat least one conditioning variable being based on conditioninginformation, the conditioning information comprising a representation ofthe converted block and a representation of block length information forthe first block, providing the at least one output variable to an outputstage (output neural network) configured to provide at least onepredicted transform coefficient from the at least one output variable.

As an alternative to quantized transform coefficients, the transformcoefficients may be distorted or impaired. The transform coefficientsoutputted by the output stage (output neural network) are enhanced inthe sense that they more closely resemble an original set of transformcoefficients and/or that the enhanced transform coefficients inverselytransformed into time domain describe a media signal which is perceivedas a higher quality media signal compared to a time domain media signaldescribed by the quantized transform coefficients. Further, a frame, asreferred to herein, may include one or more blocks (e.g., a set ofblocks).

The invention is at least partially based on the understanding that byconverting the (short) first block into a (long) converted block withthe first number of transform coefficients the generative properties ofthe trained main neural network may be introduced into variable blockswitching decoding. As neural networks have a fixed dimension in theiroutput layers they are incompatible with adaptive length blocks. Byconverting the first block of the quantized transform coefficients intoa converted block, and using a representation of the converted block anda representation of block length information to condition the mainneural network, the neural network may predict the at least one(enhanced or non-quantized) transform coefficient in a dynamic mannerbased on block length. That is, as a representation of the block lengthinformation is comprised in the conditioning information (upon which theat least one conditioning variable is based), the main neural networkwill be trained to respond appropriately to a block having beenconverted to comprise the first number of transform coefficients.

Additionally, it may further be determined that a block of the framecomprises the first number of quantized transform coefficients. Such a(long) block may not be converted to a converted block and instead arepresentation of the block with the first number of quantized transformcoefficients is comprised in the conditioning information. Besides notconverting a long block, the long block may be treated analogously to adetermined short block. The transform coefficients outputted by theoutput stage comprise the first number of transform coefficientsrepresenting either a quantized transform coefficient block with thefirst number of transform coefficients or converted block of the firstnumber of quantized transform coefficients, which in turn represents atleast one quantized transform coefficient block with the second numberof transform coefficients.

As the main neural network may predict at least one transformcoefficient for each of the variable length blocks in sequence, the mainneural network takes temporal and/or frequency dependencies intoconsideration. The main neural network may have a memory function suchthat previous inputs affect the current processing and such that theprediction of a current (enhanced) at least one transform coefficient isinfluenced by earlier transform coefficients.

The adaptive length blocks represent a trade-off between frequency andtime. A longer block comprises more transform coefficients and willrepresent a longer duration of the media signal, while a shorter blockcomprises fewer transform coefficients and will represent a shorterduration of the media signal.

According to a second aspect of the invention there is provided a methodfor obtaining at least one training block for training a computerimplemented neural network system to predict at least one transformcoefficient of an adaptive block length media signal. The methodcomprising obtaining a set of transform blocks each comprising a numberof transform coefficients representing frequency content of a mediasignal, the number of transform coefficients in each block being a firstnumber or a second number, wherein the first number is greater than thesecond number, determining that a first block comprises the secondnumber of transform coefficients, converting the first block into aconverted block having the first number of transform coefficients,obtaining a target predicted block from the converted block, quantizingthe converted block, and obtaining a training block from the quantizedconverted block.

The obtained set of transform blocks may further represent a sequence ofassociated time domain window functions (short, long, bridge-in orbridge-out).

According to a third aspect of the invention there is provided acomputer implemented neural network system for predicting at least onetransform coefficient representing frequency content of an adaptiveblock length media signal. The neural network system comprising anadaptive block pre-processing unit configured to receive a framecomprising a set of quantized transform coefficients representing apartial time segment of a media signal, receive block length informationindicating a number of quantized transform coefficients for each blockin the frame, the number of quantized transform coefficients being oneof a first number or a second number, wherein the first number isgreater than the second number, determine that at least a first blockhas the second number of transform coefficients, and convert at leastthe first block into a converted block having the first number ofquantized transform coefficients. The neural network system furthercomprising a main neural network, wherein the main neural network istrained to predict at least one output variable given at least oneconditioning variable based on conditioning information, theconditioning information comprising a representation of the convertedblock and a representation of block length information for the firstblock, and an output stage, configured to provide at least one predictedtransform coefficient from the at least one output variable.

In some implementations, the neural network system described in theabove has been trained by using a set of target prediction blocks and aset of training blocks. The set of training blocks being an impairedrepresentation of the target prediction blocks and the training blockscomprising at least one training block with the first number oftransform coefficients and at least one training block with the secondnumber of transform coefficients. The set of training blocks is providedto the adaptive block pre-processing unit of the neural network systemand it is obtained from, from the output stage of the neural networksystem a set of predicted blocks from the set of training blocks. Ameasure of the predicted blocks with respect to the set of targetprediction blocks is computed and the weights of the neural networksystem are modified to decrease the measure.

By modifying the weights of the neural network system in response to themeasure of the predicted blocks, the training will result in the neuralnetwork system learning to predict (generate) at least one transformcoefficient from at least one quantized transform coefficient. Thetraining will result in the neural network system learning to properlyrecognize the at least one conditioning variable representing a shortblock(s) and process it in a manner such that the resulting at least onepredicted transform coefficient closely resembles the at least onetransform coefficient of the media signal.

It is understood that, based on acquiring the measure, the neuralnetwork system may be trained, preferably iteratively, by modifyingparameters (e.g. the weights) of each neural network until asatisfactory small measure is achieved.

The invention according to the second and third aspects features thesame or equivalent embodiments and benefits as the invention accordingto the first aspect. Further, any functions described in relation to amethod, may have corresponding structural features in a system or codefor performing such functions in a computer program product.

Experiments have been performed for encoding and decoding a referencemedia signal with a fixed block length and an adaptive block length. Inthe case of a fixed block length, a fixed length neural network systemwas implemented in the decoder, and in the case of adaptive block lengththe neural network system according to an implementation of the currentinvention was implemented in the decoder. The fixed block lengthencoding used 256 MDCT coefficient blocks and the adaptive block lengthencoding used adaptive 256/128 MDCT coefficient blocks. When comparingthe decoded signals, adaptive block length switching with the neuralnetwork system of the present invention in the decoder showed reducedpre-echo distortion compared to the fixed block length counterpart.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will be described in more detail with reference tothe appended drawings, showing currently preferred embodiments of theinvention.

FIG. 1 shows an adaptive block length encoder and a decoder implementingthe neural network system according to embodiments of the presentinvention.

FIG. 2 shows a neural network system according to embodiments of thepresent invention.

FIG. 3 a-b show a merging process of time window functions.

FIG. 4 shows a flow chart illustrating a method for predicting at leastone transform coefficient from quantized transform coefficientsaccording to an embodiment of the invention.

FIG. 5 shows a flow chart illustrating a method for obtaining trainingblocks for training a neural network system according to embodiments ofthe present invention.

FIG. 6 shows a flow chart illustrating a method for obtaining trainingblocks for training a neural network system according to embodiments ofthe present invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

FIG. 1 depicts an adaptive block length encoder/decoder system includingan encoder 1 and a decoder 2. A media signal is received at the inputport, at a transient detector 101. The media signal may be divided in aseries of time domain frames and may further be divided into a pluralityof time domain segments wherein each segment comprises a number of mediasignal samples. For example, a time domain frame comprises 16000 signalsamples and is divided into four segments of 4000 samples. The number ofsignal samples in the time domain frame and the segments (thereby alsothe number of segments in the time domain frame) is merely exemplary,and may be any number. The transient detector 101 is configured tooptimize, for each segment, the trade-off between frequency and timeresolution by selecting a transform length. In general, the transientdetector 101 selects a long transform length for segments containingsignals with slowly-evolving or stationary temporal characteristics andselects shorter transform lengths for segments containing signals withrapidly-evolving temporal characteristics. By optimizing ‘perceptualcoding gain’ for both short and long signal classes, this approachoffers a fundamental advantage over coding with time-invariant transformlengths.

Depending on the temporal characteristics of a segment of the mediasignal the transient detector 101 may select to request that the segmentshould be represented by a transform domain block with a first number oftransform coefficients (for slowly-evolving temporal signal segments) ora plurality of transform domain blocks each comprising a second numberof transform coefficients (for rapidly-evolving temporal signalsegments), where the first number is greater than the second number. Forexample, the transient detector 101 may request that a slowly-evolvingsegment is represented with 256 transform coefficients X_(k) while arapidly-evolving segment is represented with two sets (transform domainblocks) of 128 transform coefficients X_(k), or four sets of 64transform coefficients X_(k). The number of chosen transformcoefficients are not limited to the included examples, and any numbermay be chosen. The transient detector 101 may request a number oftransform coefficients among a set of block lengths, wherein the set ofblock lengths comprises at least two lengths such as 256/128. In someimplementations, the set of block lengths comprises at least three ormore lengths such as 256/128/64 among which the transient detector 101may select a suitable length for a block. For example, the transientdetector 101 may request that a segment is represented by a combinationof short blocks of varying lengths. For example, a slowly evolvingsegment is represented by 256 transform coefficients X_(k), while afollowing rapidly evolving segment is represented by one block with 128transform coefficients X_(k) and two blocks with 64 transformcoefficients X_(k). The transient detector 101 generates block lengthinformation which represents the requested number of transform domainblocks (and/or number transform coefficients X_(k) for each block) withwhich the time domain segments should be represented. The block lengthinformation is transmitted to the decoder 2. The transient detector 101passes the block length information to the transform unit 102.

The transform unit 102 transforms the segments according to the blocklength information and outputs the adaptive length transform blockscomprising transform coefficients X_(k) to a quantizer 103. For theexample mentioned in the above, a 16000 sample time frame having beendivided into four 4000 sample segments is transformed into a series oftransform blocks with 256, 256, 128, 128 and 256 transform coefficientsX_(k) respectively. These transform blocks may then form a transformdomain frame (frame) in the encoder 1 and/or decoder 2. In other words,a frame may be referred to as a set of one or more transform blocksand/or one or more segments. In parts of the encoder 1 and in thedecoder 2, the frame to which a transform block belongs may not beexplicitly indicated or considered as it suffices to treat the transformblocks in series without regard to their respective time or transformdomain frame.

The received media signal is further received by a perceptual model 111which computes a masking threshold. The masking threshold is passed to abit allocation unit 112.

In the bit allocation unit 112, a bit allocation for the soon to bequantized transform coefficients is assigned based on the receivedperceptual masking threshold information received from the perceptualmodel 111. The bit allocation unit 112 may allocate bits to reduce orminimize the quantization noise. The bit allocation unit 112 passes thebit allocation information to the quantizer 103.

The quantizer 103 quantizes the transform coefficients X_(k) of eachblock among the adaptive block length blocks by allocating bits to eachtransform coefficient according to the received bit allocationinformation, to form quantized transform coefficient {tilde over(X)}_(k) blocks. The quantizer 103 transmits the adaptive block lengthblocks comprising quantized transform coefficients ({tilde over(X)}_(k)) to the decoder 2.

In the decoder 2, a neural network (NN) system 201 receives a frame,where each block of the frame comprises at least one quantized transformcoefficient {tilde over (X)}_(k), from the quantizer 103 of the encoder1, and block length information, from the transient detector 101 of theencoder. The neural network system 201 comprises a main neural networkand an output stage (e.g., an output neural network) trained to predictat least one transform coefficient (the at least one predicted transformcoefficient X _(k)) from quantized transform coefficients {tilde over(X)}_(k). A conversion stage of the neural network system 201 convertsblocks with the second number of quantized transform coefficients {tildeover (X)}_(k) to converted blocks comprising the first number ofquantized transform coefficients {tilde over (X)}_(k). In someimplementations the conversion stage neural network system 201 merelypasses on blocks with the first number of quantized transformcoefficients {tilde over (X)}_(k). Accordingly, the output stage of theneural network system 201 may output a sequence of static length blocks(e.g. each comprising the first number of predicted transformcoefficients X _(k)) wherein some blocks represent a quantized block ofthe same length and wherein some blocks represent at least one, and insome implementations more than one, short blocks of a different(shorter) length.

The at least one predicted transform coefficient X _(k) is received atan inversion transform unit 202 configured to transform the at least onepredicted transform coefficient X _(k) of each transform domain blockinto time domain segments (i.e. predicted time domain segments). Theinverse transform unit 202 may in some implementations receive blocklength information from the transient detector 101 of the encoder 1.

As described in the above, the at least one predicted transformcoefficient X _(k) that arrives as blocks to the inverse transform unit202 may be of a static predetermined length despite some blocksrepresenting one or more quantized blocks of an originally(pre-conversion) shorter length. As the inverse transform unit 202receives information of this original transform domain block length inthe form of block length information, the inverse transform unit 202 maytake necessary pre-inverse transform processing steps. For instance, inresponse to a predicted long block being associated with an originallyshort block which was up-sampled to form a converted block in theconversion unit, the inverse transform unit 202 may down-sample thepredicted long block to a predicted short block prior to inversetransforming the short block to the time domain. In another example, atleast two short blocks with quantized transform coefficients {tilde over(X)}_(k) are converted into a single converted block in the conversionunit and are predicted by the neural network system as a single longblock of at least one predicted transform coefficient X _(k). In such acase, the inverse transform unit 202 may determine from the block lengthinformation that the predicted long block is in fact a prediction basedon at least two short blocks (which have been combined) and in responseperform pre-inverse transform processing steps, such as splitting orperforming an inverse conversion procedure, i.e. the inverse of theconversion carried out in the neural network system 201, to obtainpredicted blocks of the same length as determined by the transientdetector 101 in the encoder 1. The pre-inverse transform processingsteps may be carried out by a separate (not shown) unit preceding aninverse transforming unit for some pre-existing coding scheme foradaptive block length media signals. For instance, the neural networksystem (together with pre-inverse transform processing) may beimplemented together with any existing codecs, e.g. to refine AC-4transform coefficients, or using it with a new codec designed fordecoding with a neural network system 201.

In yet a further implementation, the inverse transform unit 202transforms each predicted block (being of a static length) into the timedomain such as if the set of predicted blocks is from a static lengthmedia signal. In such implementations, the inverse transform unit doesnot need to consider the block length information and the neural networksystem effectively converts an adaptive block switching media signal toa static block length media signal. The neural network system 201receives blocks of varying lengths and is trained to output fixed lengthblocks. The inverse transform unit 202 transforms the static lengthblocks to a time domain media signal.

The inverse transform unit 202 outputs a time domain media signal (or asequence of time domain media signal blocks) suitable for playback by aplayback device (not shown). The neural network system 201 is configuredto receive at least one quantized transform coefficient in a block andpredict at least one transform coefficient.

With reference to FIG. 2 , an embodiment of the computer implementedneural network system 201 in FIG. 1 is depicted in more detail. Theneural network system 201 is configured to receive a set of adaptivelength blocks 20 each comprising a set of quantized transformcoefficients {tilde over (X)}_(k) representing the frequency content ofa partial time segment of a media signal and block length information 21indicating a number of quantized transform coefficients for each blockin frame 20, the number of quantized transform coefficients being one ofa first number or a second number. The computer implemented neuralnetwork system 201 further comprises a conversion stage 11 that isconfigured to determine that at least a first block has the secondnumber of quantized transform coefficients, and convert at least thefirst block into a converted block having the first number of quantizedtransform coefficients. From frame 20 to the conversion stage 11, whereframe 20 has at least one block with the second number of quantizedtransform coefficients, the conversion stage generates an output frame20′ wherein the output blocks in the output frame all have the firstnumber of quantized transform coefficients.

The neural network system 201 further receives block length information21 indicating a number of quantized transform coefficients for eachblock in frame 20. The block length information 21 thereby indicates thesequence of blocks comprising the first or second number of transformcoefficients. The block length information 21 may be a sequence ofintegers or symbols, each integer or symbol representing a block and thevalue of each integer (or the type of symbol) representing the number ofquantized transform coefficients {tilde over (X)}_(k) of that block.

The block length information 21 may comprise more than two alternativeblock lengths. In some implementations a block with the first number oftransform coefficients X_(k) that precedes a block with the secondnumber of transform coefficients X_(k) may be labelled as a bridge-inblock and a block with the first number of transform coefficients X_(k)that succeeds a block with the second number of transform coefficientsX_(k) may be labelled as a bridge-out block. Accordingly, the blocklength information 21 may be a sequence of four (or more) differentintegers, one for each of a long block (first number of transformcoefficients X_(k)), a short block (with the second number of transformcoefficients X_(k)), a bridge-in block and a bridge-out block.

The neural network system 201 forms at least one conditioning variable15 based on conditioning information, wherein the conditioninginformation comprises at least two components, (i) informationrepresenting the converted block (or representing a block comprising thefirst number of quantized transform coefficients) and (ii) informationrepresenting the block length information 21. In a simple case,information representing the converted block is the quantized transformcoefficients {tilde over (X)}_(k) per se, and the block lengthinformation representation is an integer. The at least one conditioningvariable 15 and the main neural network 16 may feature a separatedimension for each piece of conditioning information or a singledimension onto which each piece of conditioning information isprojected.

The at least one conditioning variable 15 is used to condition a mainneural network 16. The main neural network 16 is trained to predict atleast one output variable given at least one conditioning variable 15,and the at least one output variable is provided to an output neuralnetwork 17 trained to make a final prediction of at least one transformcoefficient (i.e. outputting at least one predicted transformcoefficient X _(k)) given at least one output variable from the mainneural network 16. The output neural network 17 may comprise one or morehidden layers.

The main neural network 16 may be any type of neural network, e.g. adeep neural network, a recurrent neural network or any neural networksystem. The main neural network 16 may be a regressive model. The mediasignal may be any of type of media signal including an audio or videosignal. In case of the media signal being an audio signal, the mainneural network 16 is in a preferred embodiment serving as a generalaudio generative model in the transform domain. The main neural network16 is configured to operate in the transform domain and is trained topredict at least one output variable given at least one conditioningvariable. The at least one output variable may be considered a hiddenstate and is provided to the output neural network 17, wherein theoutput neural network 17 is configured (e.g. trained) to output at leastone predicted transform coefficient given the at least one outputvariable. The output neural network 17 may be implemented together withthe main neural network 16 as a single unit, e.g. as an output stage ofthe main neural network 16 or as a separate neural network. Regardless,the output neural network 17 and the main neural network 16 exchangehidden state information.

The at least one transform coefficient X _(k) is thus predicted from theat least one quantized transform coefficients {tilde over (X)}_(k) bythe main neural network 16 and the output neural network 17 by capturingtemporal and/or frequency dependencies of the representation of thequantized transform coefficients. That is, the main neural network 16and the output neural network 17 may be trained such that previousrepresentations of transform coefficients having been processed by themain neural network 16 may influence the prediction of the current atleast one transform coefficient. Additionally or alternatively, the mainneural network 16 and output neural network 17 are trained such thatinterdependencies between transform coefficients in a current block andpast blocks are considered. As the transform coefficients representfrequency content, the main neural network 16 and the output neuralnetwork 17 may be trained to predict at least one transform coefficientsby learning how the frequency content (which is represented in thetransform coefficients) of a first frequency band affects the frequencycontent of a second frequency band.

In some implementations the neural network system 201 further comprisesan additional neural network, such as a conditioning neural network 12connected to receive output from the conversion unit 11 and receiveblock length information from block length information neural network14. The conditioning neural network 12 and the block length informationneural network 14 are used to predict a respective piece of conditioninginformation and may be any type of neural network, e.g. a convolutionallayer, and using one type does not necessitate the other type.

The conditioning neural network 12 and/or the block length informationneural network 14 may be trained to predict a respective at least oneoutput variable, where the at least one conditioning variable 15 is thenobtained as the sum of the respective at least one predicted outputvariable. Further, the at least one conditioning variable 15 beingpassed to the main neural network 16 (being e.g. a sum of the respectiveat least one output variable from the conditioning neural network 12 andblock length neural network 14) may be regarded as a hidden neuralnetwork layer. Besides establishing an inner dimension (as ahyperparameter) for the hidden layer which matches the input dimensionof the main neural network 16, the neural network system 201 may beoperated (and trained) without any constraint on the interpretability ofthe hidden layer. For example, the conditioning information representingthe quantized transform coefficients and the representation of the blocklength information may each be at least one output variable in the shapeof matrices of a dimension matching the inner dimension. The at leastone condition variable 15 may then be the sum of the at least one matrixoutput variable. In a further example, the matrices are two-dimensionaland comprise a single row or column (i.e. a vector).

The conditioning neural network 12 is trained to predict arepresentation of a block from output frame 20′ given the quantizedtransform coefficients{tilde over (X)}_(k) of the block. By predictingthe representation of the quantized transform coefficients of theconverted block, with a conditioning neural network 12 trained topredict the representation, a representation which further facilitatesprediction by the main neural network 16 may be achieved. As opposed toassigning a static translation function for the quantized transformcoefficients {tilde over (X)}_(k) that translates them into informationrepresenting the quantized transform coefficients {tilde over (X)}_(k),the conditioning neural network 15 may be trained to predict arepresentation which facilitates making the final prediction by the mainneural network 16 and the output neural network 17.

In a similar manner, the block length information neural network 14 istrained to predict a representation of the block length informationgiven block length information 21. By implementing a block length neuralnetwork 14 trained to predict a representation of the block lengthinformation given block length information 21 of at least the firstblock, the conditioning information used to condition the main neuralnetwork 16 will carry information indicating the number of quantizedtransform coefficients {tilde over (X)}_(k) in the first block in aformat that facilitates prediction of at least one transform coefficientX _(k) by the main neural network 16 and the output neural network 17.In one example, the block length neural network 14 outputs arepresentation of the block length information which indicates a blockwith the first number of transform coefficients X_(k). Accordingly, themain neural network 16 is conditioned differently, and will responddifferently, when the represented quantized transform coefficients{tilde over (X)}_(k) are from a converted block or from a quantizedblock with the first number of transform coefficients {tilde over(X)}_(k). As the main neural network 16 and output neural network 17have been trained to predict at least one transform coefficient frominformation representing the quantized transform coefficients {tildeover (X)}_(k) together with conversion unit 11, the prediction of the atleast one transform coefficient may be accomplished regardless of themanner in which the converted block was constructed from at least thefirst block.

As opposed to conditioning the block length neural network with e.g. aninteger from the sequence of integers, some implementations of theneural network system 201 comprise a One-Hot encoder 13, which convertsthe block length information 21 to One-Hot vectors which in turn areused to condition the block length neural network 14. The block lengthinformation is categorical and indicates for each block a separate state(e.g. long, short, bridge-in or bridge-out). With One-Hot encoding,these categorises are separated into individual vector elements whichfacilitates the training and prediction of the block length neuralnetwork 14 by clearly distinguishing between the different possiblestates. For example, One-Hot encoding promotes a strong spatialdependence between the predicted at least one output variable and whichinput element of the input layer of the block length neural network thatreceives the one hot (on-state) vector element.

In some implementations the neural network system 201 further receivesfor each block perceptual model coefficients pEnvQ and/or a spectralenvelope. The conditioning information may thus further includeadditional pieces of information that are a representation of perceptualmodel coefficient pEnvQ information and/or spectral envelopeinformation. The perceptual model coefficients pEnvQ and/or spectralenvelope may be processed in parallel with the block length informationand the quantized transform coefficients and either combined with otherinformation in the at least one conditioning variable 15 or provided asside information in a separate dimension, to the main neural network 16.

The set of perceptual model coefficients pEnvQ may be derived from aperceptual model, such as those occurring in the encoder, The perceptualmodel coefficients pEnvQ are computed per frequency band and arepreferably mapped onto the same resolution as the frequency coefficientsof a block to facilitate processing.

In implementations where a single short block has been converted to aconverted block, the pEnvQ coefficients are converted to an equivalentlong block representation by an analogous conversion procedure and usedas conditioning information. For example, if a short block isup-sampled, the pEnvQ coefficients are up-sampled in the same way.

It is noted that with a neural network system 201 that is ‘trained’ inimplementations featuring more than one neural network, all the neuralnetworks in the system are, during at least a portion of the training,trained together. For example, the block length neural network 14 may betrained together with the main neural network 16 wherein the innerparameters (e.g. weights) of each neural network 14, 16 are modified tooptimize some measure of the predicted at least one transformcoefficient X _(k) compared to some target predicted at least onetransform coefficient, such as the original non-quantized transformcoefficients X_(k). The block length neural network 14 is then trainedto output at least one conditioning variable 15 which brings thepredicted at least one transform coefficient of the main neural network16 and the output neural network 17 to resemble the original transformcoefficients as closely as possible. The main neural network 16 andoutput neural network 17 are simultaneously trained to predict at leastone transform coefficient X _(k) that resemble the original transformcoefficients X_(k) as closely as possible.

The conversion in the conversion unit 11 of blocks with the secondnumber of transform coefficients may involve the up-sampling of a blockwith the first number of quantized transform coefficients {tilde over(X)}_(k) to a converted block. Up-sampling may include linear orpolynomial interpolation (and optionally extrapolation) of the secondnumber of quantized transform coefficients to the first number ofquantized transform coefficients. Alternatively, up-sampling to form aconverted block may comprise one of: repeating each quantized transformcoefficient a predetermined number of times, adding zero elements inbetween non-zero elements or interleaving the quantized transformcoefficients {tilde over (X)}_(k). Alternatively, any other suitableup-sampling, expansion or interpolation technique is applicable. In someimplementations the conversion unit 11 merely forwards the quantizedtransform coefficients {tilde over (X)}_(k) of a block to the mainneural network 16, which is trained to predict at least one outputparameter for the output neural network 17. In this case the main neuralnetwork 16 will learn to recognize a block with the second number ofquantized transform coefficients {tilde over (X)}_(k) and absorb bytraining the functions of the converter.

As an alternative to converting in the conversion unit 11 a first blockcomprising the second number of quantized transform coefficients {tildeover (X)}_(k) into at least two blocks, a first block and a secondblock, each comprising the second number of quantized transformcoefficients {tilde over (X)}_(k), the first block and the second blockmay jointly be converted into a converted block comprising the firstnumber of quantized transform coefficients {tilde over (X)}_(k).Accordingly, the main neural network 16 and output neural network 17 maybe trained to predict at least one transform coefficient X _(k) given arepresentation of a converted block comprising a first number ofquantized transform coefficients {tilde over (X)}_(k), where thequantized transform coefficients {tilde over (X)}_(k) of the convertedblock originate from the quantized transform coefficients {tilde over(X)}_(k) of at least the first and second block.

In general, the at least first and second blocks having the secondnumber of quantized transform coefficients {tilde over (X)}_(k) may be Nconsecutive blocks having the second number of quantized transformcoefficients {tilde over (X)}_(k), where the first number is a multipleN of the second number. The N consecutive blocks may then be convertedto a converted block with the first number of quantized transformcoefficients {tilde over (X)}_(k). The adaptive block switching mediasignal may, for example, include a first number of quantized transformcoefficients {tilde over (X)}_(k) equal to 256 and a second number equalto 128, i.e. for N=2. A first number equal to 256 and N=4 would resultin four short blocks, each comprising 64 quantized transformcoefficients {tilde over (X)}_(k), being converted into one convertedblock. In yet a further example, N=8, when the first number of transformcoefficients is 1024, then the second number of quantized transformcoefficients {tilde over (X)}_(k) is 128.

Converting at least the first and second block into a converted blockmay comprise concatenating at least the first and the second block intoa converted block. Concatenation is an efficient and easily implementedmethod of converting at least the first and second block into aconverted block.

In some implementations the conversion unit 11 receives for each block arepresentation of a respective time domain window function, where thewindow function of the first and second block partially overlap.

The window functions may be received together with the quantizedtransform coefficients {tilde over (X)}_(k) or with the block lengthinformation 21 (being passed onto the conversion unit 11).Alternatively, the window functions may be constructed from the blocklength information 21 (being passed to the conversion unit 11). Or, thewindow functions may be constructed by determining the number ofquantized transform coefficients {tilde over (X)}_(k) for a block in theconversion unit 11 by utilizing the correlation between number ofquantized transform coefficients in a block and the sequence of theblocks with at least the first and second numbers of quantized transformcoefficients in each block. For example, a block with the first numberof quantized transform coefficients {tilde over (X)}_(k) is associatedwith a long window function and a block with the second number oftransform quantized coefficients {tilde over (X)}_(k) is associated witha short window function. In a further example, a block with the firstnumber of quantized transform coefficients {tilde over (X)}_(k) may beassociated with a bridge-in window function if this block precedes ablock with the second number of quantized transform coefficients {tildeover (X)}_(k).

In FIG. 2 , all of the functions and units described as operatingup-stream of the (optional) conditioning neural network 12 and the(optional) block length information neural network 14 may be referred toas a pre-processing unit or an adaptive block pre-processing unit. Thepre-processing unit may thus be a multiple input multiple/single outputunit, e.g. receiving block length information 12 and quantized transformcoefficients {tilde over (X)}_(k) and output information representingthe quantized transform coefficients {tilde over (X)}_(k) andrepresenting the block length information 12 as separate pieces ofinformation (at least one variable) or a combined piece of information(at least one variable).

With further reference to FIG. 6 there is depicted a flow chartillustrating a method for training the neural network system, forexample the embodiment depicted in FIG. 2 . At S311 a set of adaptivelength target prediction (true) blocks are provided. This occursalongside providing a set of training blocks being an impairedrepresentation of the target prediction blocks (e.g. a quantizedrepresentation) at S321. The target prediction blocks comprise anon-quantized set of transform coefficients X_(k). The training blocksare provided to the neural network system 201 and processed such that aset of predicted blocks are obtained at S331. By comparing the outputtedpredicted blocks comprising the at least one predicted transformcoefficient X _(k) with the target prediction blocks, a measure, e.g. ofsimilarity, is obtained at S332. The measure may be an error measure,wherein a low error measure indicates a high level of similarity. Themeasure may be a negative likelihood, such as the negative loglikelihood (NLL), wherein a low measure indicates a high level ofsimilarity. The measure may be a Mean Absolute Error (MAE) or a MeanSquare Error (MSE), where a high level of similarity will be indicatedby a low MAE or MSE. At S333 the measure is used for modifying theweights of the neural network system 201 to reduce or minimize themeasure.

In one example, the measure is referred to as a loss function or ‘loss’,as is directly computed as the NLL as

Loss=NLL(X _(k) , X _(k)).  (1)

In calculating the NLL loss the predicted at least one transformcoefficent X _(k) is repersented by at least one distribution parameterfor the at least one predicted transform coefficent X _(k). The NLLfunction is thus applied to the at least one distribution parameterwhich repersents the predicted at least one transfrom coefficent X _(k).The at least one distribution parameter parametrizes a probabilitydistribution for the at least one the at least one predicted transformcoefficent X _(k).

In other implementations the loss is calculated as the MSE according to:

$\begin{matrix}{{Loss} = {\frac{1}{K}{\sum\left( {X_{k} - {\overset{¯}{X}}_{k}} \right)^{2}}}} & (2)\end{matrix}$

or the loss may be calculated as the MAE according to:

$\begin{matrix}{{Loss} = {\frac{1}{K}{\sum{{❘{X_{k} - {\overset{¯}{X}}_{k}}❘}.}}}} & (3)\end{matrix}$

In calculating the MSE and MAE loss the at least one predicted transformcoefficient X _(k) is used as such.

In some cases, a predicted block may represent more than one trainingblock (and the associated target prediction block) with a singlepredicted converted block, in such cases the predicted blocks may beinversely converted into blocks individually corresponding to a trainingblock (and the associated target prediction block) such that the measuremay be computed.

With reference to FIG. 3 a there is illustrated a sequence of timedomain window functions 31, 32 a, 32 b, 33. FIG. 3 a illustrates thewindow sequence for a typical 2:1 block length switch. The first longwindow 31 is followed by two short windows 32 a, 32 b, which in turn arefollowed by a second long window 33. The short time domain windowfunctions 32 a, 32 b may overlap by 50%, where adding the squared shortwindow functions results in a value of one for the overlapping portion.Additionally, the sum of the square of each window function 31, 32 a, 32b, 33 will result in a value of one for every overlap.

In some implementations, the long windows 31, 33 may further be abridge-in window 31 and a bridge-out window 33 respectively, especiallyadapted to respectively precede and succeed short windows 32 a, 32 b.The window functions 31, 32 a, 32 b, 33 are at least partiallyoverlapping in time. Each window function 31, 32 a, 32 b, 33 isassociated with a set of transform coefficient blocks, a long transformcoefficient block with a long window function 31, 33, and a shorttransform coefficient block with a short window function 32 a, 32 b.

In some additional implementations, where the number of transformcoefficients in each block is one out of more than two alternatives(e.g. one out of 256, 128 and 64 coefficients as mentioned above) thebridge-in window 31 and a bridge-out window 33 functions may comprisemore than two bridging window functions, e.g. one for each type oftransition between the variable length blocks. If the blocks have alength of one out of 256, 128 and 64 there may be defined an in and outbridging window function for each of: 256 to 128, 256 to 64 and 128 to64.

With further reference to FIG. 3 b there is illustrated a long convertedwindow 32 (with an associated long converted block) that is the resultof a conversion of two short window functions 32 a, 32 b (and two shorttransform coefficient blocks).

By inverse transforming the quantized transform coefficients of a firstand second (short) block (their respective window function is shown inFIG. 3 a as 32 a and 32 b) back into a windowed time domainrepresentation, they may be merged into a long converted block. This maybe achieved by overlap adding the windowed time domain representation ofthe first and second blocks and transforming the overlap added timedomain representation of the first and second blocks into a convertedblock having the first number of quantized transform coefficients.

For example, if the transform coefficients are Modified Discrete CosineTransform (MDCT) coefficients, the intervening short blocks (associatedwith window functions 32 a, 32 b) may be merged into a single long blockby inverting the MDCT to short time domain segments and overlap addingthe short time domain segments. A DCT type 4 may then be used to computetransform coefficients of the equivalent converted long block 32 with aflat-top window. The window sequence after this merging/conversionoperation is shown in FIG. 3 b . It is further noted that this procedureof conversion may be accomplished while preserving perfectreconstruction properties of the transform coefficents (in the absenceof quantization).

With reference to FIG. 4 , there is depicted a flow chart illustrating amethod for predicting at least one transform coefficient from quantizedtransform coefficients according to an embodiment of the invention. AtS111, the neural network system receives a frame comprising quantizedtransform coefficients. The neural network system determines that atleast one block of the frame comprises the second number of transformcoefficients at S112 and proceeds by converting at least the block withthe second number of transform coefficients into a converted block withthe first number of transform coefficients at S113. Informationrepresenting the quantized transform coefficients of a converted blockis one piece of information upon which at least one conditioningvariable, used to condition the main neural network at S131, is basedupon. Optionally, the method involves conditioning a conditioning neuralnetwork at S114 with information representing the quantized transformcoefficients of a converted block and using the at least one outputvariable of the conditioning neural network to condition the main neuralnetwork at S131.

Further, the method involves receiving block length information at S121.A representation of the block length information is used as one piece ofinformation for conditioning the main neural network at S131.Optionally, the block length information is used to first condition ablock length neural network at S123 wherein the predicted at least oneoutput variable of the block length neural network is used to conditionthe main neural network at S131. Also, some embodiments comprise One-Hotencoding of the block length information at S122, wherein the One-Hotencoded block length information is used to either condition the blocklength neural network at S123 or as information which is part of theinformation used to condition the main neural network at S131.

At S131, the main neural network predicts at least one output variablegiven the at least one conditioning variable and wherein the at leastone output variable is provided to the output stage (e.g., an outputneural network) at S132. The output stage at S132 predicts the at leastone transform coefficient.

FIG. 5 depicts a flow chart illustrating a method for obtaining trainingblocks (training blocks for input and target predicted blocks forcomparison with the output) for training a neural network system forpredicting the transform coefficients of an adaptive block length mediasignal according to embodiments of the present invention. At S211, a setof transform blocks is obtained. For example, a batch of waveforms or amedia signal has been divided into a set of time domain segments (e.g.forming a time domain frame) and each time domain segment has beentransformed into a set of varying length transform blocks (e.g. atransform domain frame). Alternatively, a batch of waveforms or a mediasignal has been processed with a transient detector as described in theabove to determine the length of each block. At S212 it is determinedthat a first block comprises the second number of transform coefficientsand this block is converted at S213 to a converted block with a firstnumber of transform coefficients. At S221, a target predicted block isobtained. The target predicted block obtained at S221 may be theconverted block itself.

At S231 the converted block is quantized to form a quantized block. Thatis, the quantized block does not represent the complete informationoriginally present in the determined first block, thus the quantizedblock may be referred to as an impaired block which the neural networkshould learn to use to predict a non-impaired block. At S232 a trainingblock is obtained from the quantized block obtained at S231. Thetraining block may be quantized block as such. In some implementations,the further steps of using the target training block as input to theneural network during training and using the target predicted block asthe training is included.

Blocks determined to comprise the first number of transform coefficientsmay be processed analogously to obtain training blocks and targetpredicted blocks, wherein the step S213 is omitted.

In some implementations, a media signal or a batch of waveforms isprocessed with a transient detector which determines the transformlength as discussed in the above. Thus, the set of transform blocks willcontain all different types of blocks and window functions.

In the above, possible methods of training and operating adeep-learning-based system for determining an indication of an audioquality of an input audio sample, as well as possible implementations ofsuch system have been described. Additionally, the present disclosurealso relates to an apparatus for carrying out these methods. An exampleof such apparatus may comprise a processor (e.g., a central processingunit (CPU), a graphics processing unit (GPU), a digital signal processor(DSP), one or more application specific integrated circuits (ASICs), oneor more radio-frequency integrated circuits (RFICs), or any combinationof these) and a memory coupled to the processor. The processor may beadapted to carry out some or all of the steps of the methods describedthroughout the disclosure.

The apparatus may be a server computer, a client computer, a personalcomputer (PC), a tablet PC, a set-top box (STB), a personal digitalassistant (PDA), a cellular telephone, a smartphone, a web appliance, anetwork router, switch or bridge, or any machine capable of executinginstructions (sequential or otherwise) that specify actions to be takenby that apparatus. Further, the present disclosure shall relate to anycollection of apparatus that individually or jointly executeinstructions to perform any one or more of the methodologies discussedherein.

The present disclosure further relates to a program (e.g., computerprogram) comprising instructions that, when executed by a processor,cause the processor to carry out some or all of the steps of the methodsdescribed herein.

Yet further, the present disclosure relates to a computer-readable (ormachine-readable) storage medium storing the aforementioned program.Here, the term “computer-readable storage medium” includes, but is notbe limited to, data repositories in the form of solid-state memories,optical media, and magnetic media, for example.

Unless specifically stated otherwise, as apparent from the followingdiscussions, it is appreciated that throughout the disclosurediscussions utilizing terms such as “processing”, “computing”,“calculating”, “determining”, “analyzing” or the like, refer to theaction and/or processes of a computer or computing system, or similarelectronic computing devices, that manipulate and/or transform datarepresented as physical, such as electronic, quantities into other datasimilarly represented as physical quantities.

In a similar manner, the term “processor” may refer to any device orportion of a device that processes electronic data, e.g., from registersand/or memory to transform that electronic data into other electronicdata that, e.g., may be stored in registers and/or memory. A “computer”or a “computing machine” or a “computing platform” may include one ormore processors.

The methodologies described herein are, in one example embodiment,performable by one or more processors that accept computer-readable(also called machine-readable) code containing a set of instructionsthat when executed by one or more of the processors carry out at leastone of the methods described herein. Any processor capable of executinga set of instructions (sequential or otherwise) that specify actions tobe taken are included. Thus, one example is a typical processing systemthat includes one or more processors. Each processor may include one ormore of a CPU, a graphics processing unit, and a programmable DSP unit.The processing system further may include a memory subsystem includingmain RAM and/or a static RAM, and/or ROM. A bus subsystem may beincluded for communicating between the components. The processing systemfurther may be a distributed processing system with processors coupledby a network. If the processing system requires a display, such adisplay may be included, e.g., a liquid crystal display (LCD) or acathode ray tube (CRT) display. If manual data entry is required, theprocessing system also includes an input device such as one or more ofan alphanumeric input unit such as a keyboard, a pointing control devicesuch as a mouse, and so forth. The processing system may also encompassa storage system such as a disk drive unit. The processing system insome configurations may include a sound output device, and a networkinterface device. The memory subsystem thus includes a computer-readablecarrier medium that carries computer-readable code (e.g., software)including a set of instructions to cause performing, when executed byone or more processors, one or more of the methods described herein.Note that when the method includes several elements, e.g., severalsteps, no ordering of such elements is implied, unless specificallystated. The software may reside in the hard disk, or may also reside,completely or at least partially, within the RAM and/or within theprocessor during execution thereof by the computer system. Thus, thememory and the processor also constitute computer-readable carriermedium carrying computer-readable code. Furthermore, a computer-readablecarrier medium may form, or be included in a computer program product.

In alternative example embodiments, the one or more processors operateas a standalone device or may be connected, e.g., networked to otherprocessor(s), in a networked deployment, the one or more processors mayoperate in the capacity of a server or a user machine in server-usernetwork environment, or as a peer machine in a peer-to-peer ordistributed network environment. The one or more processors may form apersonal computer (PC), a tablet PC, a Personal Digital Assistant (PDA),a cellular telephone, a web appliance, a network router, switch orbridge, or any machine capable of executing a set of instructions(sequential or otherwise) that specify actions to be taken by thatmachine.

Note that the term “machine” shall also be taken to include anycollection of machines that individually or jointly execute a set (ormultiple sets) of instructions to perform any one or more of themethodologies discussed herein.

Thus, one example embodiment of each of the methods described herein isin the form of a computer-readable carrier medium carrying a set ofinstructions, e.g., a computer program that is for execution on one ormore processors, e.g., one or more processors that are part of webserver arrangement. Thus, as will be appreciated by those skilled in theart, example embodiments of the present disclosure may be embodied as amethod, an apparatus such as a special purpose apparatus, an apparatussuch as a data processing system, or a computer-readable carrier medium,e.g., a computer program product. The computer-readable carrier mediumcarries computer readable code including a set of instructions that whenexecuted on one or more processors cause the processor or processors toimplement a method. Accordingly, aspects of the present disclosure maytake the form of a method, an entirely hardware example embodiment, anentirely software example embodiment or an example embodiment combiningsoftware and hardware aspects. Furthermore, the present disclosure maytake the form of carrier medium (e.g., a computer program product on acomputer-readable storage medium) carrying computer-readable programcode embodied in the medium.

The software may further be transmitted or received over a network via anetwork interface device. While the carrier medium is in an exampleembodiment a single medium, the term “carrier medium” should be taken toinclude a single medium or multiple media (e.g., a centralized ordistributed database, and/or associated caches and servers) that storethe one or more sets of instructions. The term “carrier medium” shallalso be taken to include any medium that is capable of storing, encodingor carrying a set of instructions for execution by one or more of theprocessors and that cause the one or more processors to perform any oneor more of the methodologies of the present disclosure. A carrier mediummay take many forms, including but not limited to, non-volatile media,volatile media, and transmission media. Non-volatile media includes, forexample, optical, magnetic disks, and magneto-optical disks. Volatilemedia includes dynamic memory, such as main memory. Transmission mediaincludes coaxial cables, copper wire and fiber optics, including thewires that comprise a bus subsystem. Transmission media may also takethe form of acoustic or light waves, such as those generated duringradio wave and infrared data communications. For example, the term“carrier medium” shall accordingly be taken to include, but not belimited to, solid-state memories, a computer product embodied in opticaland magnetic media; a medium bearing a propagated signal detectable byat least one processor or one or more processors and representing a setof instructions that, when executed, implement a method; and atransmission medium in a network bearing a propagated signal detectableby at least one processor of the one or more processors and representingthe set of instructions.

It will be understood that the steps of methods discussed are performedin one example embodiment by an appropriate processor (or processors) ofa processing (e.g., computer) system executing instructions(computer-readable code) stored in storage. It will also be understoodthat the disclosure is not limited to any particular implementation orprogramming technique and that the disclosure may be implemented usingany appropriate techniques for implementing the functionality describedherein. The disclosure is not limited to any particular programminglanguage or operating system.

Reference throughout this disclosure to “one example embodiment”, “someexample embodiments” or “an example embodiment” means that a particularfeature, structure or characteristic described in connection with theexample embodiment is included in at least one example embodiment of thepresent disclosure. Thus, appearances of the phrases “in one exampleembodiment”, “in some example embodiments” or “in an example embodiment”in various places throughout this disclosure are not necessarily allreferring to the same example embodiment. Furthermore, the particularfeatures, structures or characteristics may be combined in any suitablemanner, as would be apparent to one of ordinary skill in the art fromthis disclosure, in one or more example embodiments.

As used herein, unless otherwise specified the use of the ordinaladjectives “first”, “second”, “third”, etc., to describe a commonobject, merely indicate that different instances of like objects arebeing referred to and are not intended to imply that the objects sodescribed must be in a given sequence, either temporally, spatially, inranking, or in any other manner.

In the claims below and the description herein, any one of the termscomprising, comprised of or which comprises is an open term that meansincluding at least the elements/features that follow, but not excludingothers. Thus, the term comprising, when used in the claims, should notbe interpreted as being limitative to the means or elements or stepslisted thereafter. For example, the scope of the expression a devicecomprising A and B should not be limited to devices consisting only ofelements A and B. Any one of the terms including or which includes orthat includes as used herein is also an open term that also meansincluding at least the elements/features that follow the term, but notexcluding others. Thus, including is synonymous with and meanscomprising.

It should be appreciated that in the above description of exampleembodiments of the disclosure, various features of the disclosure aresometimes grouped together in a single example embodiment, Fig., ordescription thereof for the purpose of streamlining the disclosure andaiding in the understanding of one or more of the various inventiveaspects. This method of disclosure, however, is not to be interpreted asreflecting an intention that the claims require more features than areexpressly recited in each claim. Rather, as the following claimsreflect, inventive aspects lie in less than all features of a singleforegoing disclosed example embodiment. Thus, the claims following theDescription are hereby expressly incorporated into this Description,with each claim standing on its own as a separate example embodiment ofthis disclosure.

Furthermore, while some example embodiments described herein includesome but not other features included in other example embodiments,combinations of features of different example embodiments are meant tobe within the scope of the disclosure, and form different exampleembodiments, as would be understood by those skilled in the art. Forexample, in the following claims, any of the claimed example embodimentscan be used in any combination.

In the description provided herein, numerous specific details are setforth. However, it is understood that example embodiments of thedisclosure may be practiced without these specific details. In otherinstances, well-known methods, structures and techniques have not beenshown in detail in order not to obscure an understanding of thisdescription.

Thus, while there has been described what are believed to be the bestmodes of the disclosure, those skilled in the art will recognize thatother and further modifications may be made thereto without departingfrom the spirit of the disclosure, and it is intended to claim all suchchanges and modifications as fall within the scope of the disclosure.For example, any formulas given above are merely representative ofprocedures that may be used. Functionality may be added or deleted fromthe block diagrams and operations may be interchanged among functionalblocks. Steps may be added or deleted to methods described within thescope of the present disclosure.

Various aspects of the present invention may be appreciated from thefollowing list of enumerated example embodiments (EEEs):

EEE1. A method for predicting, with a computer implemented neuralnetwork system, at least one transform coefficient representingfrequency content of an adaptive block length media signal, comprisingthe steps of:

receiving a block of a frame, each block of the frame comprising a setof quantized transform coefficients representing a partial time segmentof said media signal,

receiving block length information indicating a number of quantizedtransform coefficients for each block of the frame, the number ofquantized transform coefficients being one of a first number or a secondnumber, wherein said first number is greater than said second number,

determining that at least a first block of the frame has said secondnumber of quantized transform coefficients,

converting at least said first block into a converted block having saidfirst number of quantized transform coefficients,

conditioning a main neural network trained to predict at least oneoutput variable given at least one conditioning variable, the at leastone conditioning variable being based on conditioning information, saidconditioning information comprising a representation of said convertedblock and a representation of block length information for said firstblock;

providing said at least one output variable to an output stageconfigured to provide at least one predicted transform coefficient fromsaid at least one output variable.

EEE2. The method of EEE 1, further comprising receiving a set ofperceptual model coefficients for each block of the frame, and whereinthe conditioning information further includes said set of perceptualmodel coefficients.

EEE3. The method of EEE 1, further comprising receiving a spectralenvelope for each block in said frame, and wherein the conditioninginformation further includes said spectral envelope.

EEE4. The method of EEE 1, further comprising:

conditioning a block length neural network with said representation ofthe block length information for said first block, said block lengthneural network being trained to output said representation of the blocklength information for said first block given block length information.

EEE5. The method of EEE 4, wherein conditioning the block length neuralnetwork with said block length information comprises encoding said blocklength information as a one-hot vector and conditioning said blocklength neural network with said one-hot vector.

EEE6. The method of EEE 1, further comprising the step:

conditioning a conditioning neural network with said quantized transformcoefficients of said converted block, wherein the conditioning neuralnetwork is trained to output said representation of said converted blockgiven quantized transform coefficients.

EEE7. The method of EEE 1, wherein converting at least said first blockinto said converted block comprises up-sampling said first block.

EEE8. The method of EEE 1, further comprising determining that at leastsaid first block and a following second block have said second number oftransform coefficients, and wherein converting at least said first blockinto said converted block comprises converting at least said first andsecond block into a converted block.

EEE9. The method according to any preceding EEE, wherein the quantizedtransform coefficients representing frequency content are DiscreteCosine Transform, DCT, coefficients.

EEE10. The method according to any preceding EEE further comprising:

receiving, by an inverse transform unit, said predicted transformcoefficients and said block length information,

transforming said predicted transform coefficients into a time domainsignal.

EEE11. The method according to EEE 8, wherein said first number is amultiple N of said second number and determining that at least saidfirst block and said following second block have said second number ofquantized transform coefficients comprises

determining that N consecutive blocks of the frame have said secondnumber of quantized transform coefficients.

EEE12. The method according EEE 8, wherein converting at least saidfirst and second block into said converted block comprises concatenatingat least said first and second block into a converted block.

EEE13. The method according to EEE 8, wherein receiving the block lengthinformation comprises:

receiving, for each block of the frame, a representation of a respectivetime domain window function, wherein the window function of said firstand second block partially overlap.

EEE14. The method according to EEE 13, wherein converting at least saidfirst and second block into said converted block comprises:

inverse transforming the quantized transform coefficients into awindowed time domain representation of the first and second block,

overlap-adding the windowed time domain representation of the first andsecond block,

transforming the overlap-added time domain representation of the firstand second block into a converted block having said first number ofquantized transform coefficients.

EEE15. A method for obtaining at least one training block for training acomputer implemented neural network system to predict at least onetransform coefficient of an adaptive block length media signal,comprising:

obtaining a set of transform blocks each comprising a number oftransform coefficients representing frequency content of a media signal,the number of transform coefficients in each block being a first numberor a second number, wherein the first number is greater than the secondnumber,

determining that a first block comprises the second number of transformcoefficients,

converting the first block into a converted block having the firstnumber of transform coefficients,

obtaining a target predicted block from the converted block,

quantizing the converted block, and

obtaining a training block from the quantized converted block.

EEE16. A computer implemented neural network system for predictingtransform coefficients representing frequency content of an adaptiveblock length media signal, said neural network system comprising:

an adaptive block pre-processing unit configured to:

-   -   receive a frame comprising a set of quantized transform        coefficients representing a partial time segment of a media        signal,    -   receive block length information indicating a number of        quantized transform coefficients for each block in said frame,        the number of quantized transform coefficients being one of a        first number or a second number, wherein said first number is        greater than said second number,    -   determine that at least a first block has said second number of        transform coefficients, and    -   convert at least said first block into a converted block having        said first number of quantized transform coefficients,

a main neural network, wherein said main neural network is trained topredict at least one output variable given at least one conditioningvariable based on conditioning information, said conditioninginformation comprising a representation of said converted block and arepresentation of block length information for said first block, and

an output stage, configured to provide at least one predicted transformcoefficient from said at least one output variable.

EEE17. A neural network decoder, comprising the computer implementedneural network system according to EEE 16.

EEE18. A neural network decoder according to EEE 17, further comprisingan inverse transform unit,

said inverse transform unit being configured to:

-   -   receive said at least one predicted transform coefficient and        block length information, and    -   transform said at least one predicted transform coefficient to a        time domain signal.

EEE19. The neural networks system according to EEE 16, wherein saidneural networks system has been trained by:

providing a set of target prediction blocks,

providing, to said adaptive block pre-processing unit, a set of trainingblocks comprising at least one training block with said first number oftransform coefficients and at least one training block with said secondnumber of transform coefficients, the set of training blocks being animpaired representation of said set of target prediction blocks,

obtaining, from said output stage, a set of predicted blocks from saidset of training blocks,

computing a measure of the set of predicted blocks with respect to saidset of target prediction blocks,

modifying the weights of said neural network system to decrease themeasure.

EEE20. The neural network system according to EEE 19, wherein saidmeasure is one of a negative likelihood, a mean square error or anabsolute error.

1-21. (canceled)
 22. A method for predicting, with a computerimplemented neural network system, at least one transform coefficientrepresenting frequency content of an adaptive block length media signal,comprising the steps of: receiving a frame including one or more blocks,each block of the frame comprising a set of quantized transformcoefficients representing a partial time segment of said media signal,receiving block length information indicating a number of quantizedtransform coefficients for each block of the frame, the number ofquantized transform coefficients being one of a first number or a secondnumber, wherein said first number is greater than said second number,determining that at least a first block of the frame has said secondnumber of quantized transform coefficients, converting at least saidfirst block into a converted block having said first number of quantizedtransform coefficients, conditioning a main neural network trained topredict at least one output variable given at least one conditioningvariable, the at least one conditioning variable being based onconditioning information, said conditioning information comprising arepresentation of said converted block and a representation of blocklength information for said first block, providing said at least oneoutput variable to an output stage configured to provide at least onepredicted transform coefficient from said at least one output variable.23. The method according to claim 22, further comprising receiving a setof perceptual model coefficients for each block of the frame, andwherein the conditioning information further includes said set ofperceptual model coefficients.
 24. The method according to claim 22,further comprising receiving a spectral envelope for each block in saidframe, and wherein the conditioning information further includes saidspectral envelope.
 25. The method according to claim 22, furthercomprising: conditioning a block length neural network with saidrepresentation of the block length information for said first block,said block length neural network being trained to output saidrepresentation of the block length information for said first blockgiven block length information.
 26. The method according to claim 25,wherein conditioning the block length neural network with said blocklength information comprises encoding said block length information as aone-hot vector and conditioning said block length neural network withsaid one-hot vector.
 27. The method according to claim 22, furthercomprising the step: conditioning a conditioning neural network withsaid quantized transform coefficients of said converted block, whereinthe conditioning neural network is trained to output said representationof said converted block given quantized transform coefficients.
 28. Themethod according to claim 22, wherein converting at least said firstblock into said converted block comprises up-sampling said first block.29. The method according to claim 22, wherein the quantized transformcoefficients representing frequency content are Discrete CosineTransform, DCT, coefficients.
 30. The method according to claim 22,further comprising: receiving, by an inverse transform unit, saidpredicted transform coefficients and said block length information,transforming said predicted transform coefficients into a time domainsignal.
 31. The method according to claim 22, further comprisingdetermining that at least said first block and a following second blockhave said second number of transform coefficients, and whereinconverting at least said first block into said converted block comprisesconverting at least said first and second block into a converted block.32. The method according to claim 31, wherein said first number is amultiple N of said second number and determining that at least saidfirst block and said following second block have said second number ofquantized transform coefficients comprises determining that Nconsecutive blocks of the frame have said second number of quantizedtransform coefficients.
 33. The method according to claim 31, whereinconverting at least said first and second block into said convertedblock comprises concatenating at least said first and second block intoa converted block.
 34. The method according to claim 31, whereinreceiving the block length information comprises: receiving, for eachblock of the frame, a representation of a respective time domain windowfunction, wherein the window function of said first and second blockpartially overlap.
 35. The method according to claim 34, whereinconverting at least said first and second block into said convertedblock comprises: inverse transforming the quantized transformcoefficients into a windowed time domain representation of the first andsecond block, overlap-adding the windowed time domain representation ofthe first and second block, transforming the overlap-added time domainrepresentation of the first and second block into a converted blockhaving said first number of quantized transform coefficients.
 36. Amethod for obtaining at least one training block for training a computerimplemented neural network system to predict at least one transformcoefficient of an adaptive block length media signal, comprising:obtaining a set of transform blocks each comprising a number oftransform coefficients representing frequency content of a media signal,the number of transform coefficients in each block being a first numberor a second number, wherein the first number is greater than the secondnumber, determining that a first block comprises the second number oftransform coefficients, converting the first block into a convertedblock having the first number of transform coefficients, obtaining atarget predicted block from the converted block, quantizing theconverted block, and obtaining a training block from the quantizedconverted block.
 37. A computer implemented neural network system forpredicting transform coefficients representing frequency content of anadaptive block length media signal, said neural network systemcomprising: an adaptive block pre-processing unit configured to: receivea frame including one or more blocks, each block of the frame comprisinga set of quantized transform coefficients representing a partial timesegment of a media signal, receive block length information indicating anumber of quantized transform coefficients for each block in said frame,the number of quantized transform coefficients being one of a firstnumber or a second number, wherein said first number is greater thansaid second number, determine that at least a first block has saidsecond number of transform coefficients, and convert at least said firstblock into a converted block having said first number of quantizedtransform coefficients, a main neural network, wherein said main neuralnetwork is trained to predict at least one output variable given atleast one conditioning variable based on conditioning information, saidconditioning information comprising a representation of said convertedblock and a representation of block length information for said firstblock, and an output stage, configured to provide at least one predictedtransform coefficient from said at least one output variable.
 38. Theneural network system according to claim 37, wherein said neural networksystem has been trained by: providing a set of target prediction blocks,providing, to said adaptive block pre-processing unit, a set of trainingblocks comprising at least one training block with said first number oftransform coefficients and at least one training block with said secondnumber of transform coefficients, the set of training blocks being animpaired representation of said set of target prediction blocks,obtaining, from said output stage, a set of predicted blocks from saidset of training blocks, computing a measure of the set of predictedblocks with respect to said set of target prediction blocks, modifyingthe weights of said neural network system to decrease the measure. 39.The neural network system according to claim 38, wherein said measure isone of a negative likelihood, a mean square error or an absolute error.40. A neural network decoder, comprising the computer implemented neuralnetwork system according to claim
 37. 41. A neural network decoderaccording to claim 40, further comprising an inverse transform unit,said inverse transform unit being configured to: receive said at leastone predicted transform coefficient and block length information, andtransform said at least one predicted transform coefficient to a timedomain signal.